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Abstract 

As  humans,  we  develop  the  ability  to  identify  people  by  their  voice  at  an  early  age. 
Getting  computers  to  perform  the  same  task  has  proven  to  be  an  interesting  problem. 
Speaker  recognition  involves  two  applications,  speaker  identification  and  speaker  verifica¬ 
tion.  Both  applications  are  examined  in  this  effort. 

Two  methods  are  employed  to  perform  speaker  recognition.  The  first  is  an  en- 
hzmcement  of  hidden  Markov  models.  Rather  than  alter  some  part  of  the  model  itself,  a 
single-layer  perceptron  is  added  to  perform  neural  post-processing.  The  second  solution  is 
the  novel  application  of  an  enh2mced  Featmre  Space  Trajectory  Neural  Network  to  speaker 
recognition.  The  Feature  Space  Trajectory  was  developed  for  image  processing  for  tem¬ 
poral  recognition  and  has  been  demonstrated  to  outperform  the  hidden  Markov  model  for 
some  image  sequence  applications. 

Neural  post-processing  of  hidden  Markov  models  is  shown  to  improve  performance 
of  both  aspects  of  speaker  recognition  by  increasing  the  identification  rate  firom  70.23%  to 
88.44%  and  reducing  the  Equal  Error  Rate  fi:om  3.38%  to  1.56%.  In  addition,  a  new  method 
of  cohort  selection  is  implemented  based  on  the  structure  of  the  single-layer  percceptron. 

Feasibility  of  using  Feature  Space  Trajectory  Neural  Networks  for  speeiker  recognition 
is  demonstrated.  Favorable  identification  results  of  65.52%  Eire  obtained  when  using  a 
large  training  database.  The  FST  configurations  tested  outperformed  a  comparable  HMM 
system  by  12-24%. 
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Speaker  Recognition  by  Hidden  Markov  Models  and  Neural  Networks 


I.  Introduction 

1.1  Background 

As  humans,  we  have  developed  the  ability  to  identify  people  by  merely  hearing  their 
voices.  We  can  do  this  if  they  are  in  the  same  room  with  us,  down  the  hall,  on  the 
telephone,  or  even  talking  through  a  personal  address  system.  What  makes  this  possible? 
What  does  our  brain  use  to  discriminate  one  person’s  voice  from  another? 

It  is  easy  to  tmderstand  how  to  differentiate  a  male  speaker  from  a  female  speaker 
because  in  most  cases,  the  male’s  voice  has  a  lower  pitch.  The  problem  becomes  more 
difficult  when  trying  to  discriminate  one  psn-ticular  male  from  a  group  of  all  male  spezikers. 
Maybe  we  can  use  the  fact  that  one  speaker  has  a  Southern  accent  while  the  others  do 
not.  It  could  also  be  the  case  that  the  speaker  pronoimces  certedn  words  differently  th2m 
other  spe2kers.  We  have  developed  this  discriminative  ability  and  use  it  without  giving  it 
much  thought,  programming  a  computer  or  machine  to  perform  the  same  tEtsk  hzis  been 
difficult. 

Speech  has  a  temporal  component  in  that  when  a  given  word  is  spoken,  sounds  must 
be  in  a  certain  order.  If  these  same  soimds  are  produced  in  permuted  order,  they  will 
not  produce  the  same  word.  This  temporal  information  will  be  the  basis  of  the  methods 
that  are  used  in  this  thesis  for  speaker  recognition.  Simileir  to  the  analogy  that  sounds 
must  be  ordered  to  represent  a  given  word,  the  manner  that  a  speaker  produces  that  order 
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is  also  temporally  based.  A  good  illustration  is  people  with  a  Southern  accent.  Their 
pronunciation  of  certain  words  may  be  extended  relative  to  that  of  New  Englanders. 

Speaker  recognition  research  is  divided  into  two  applications.  The  first  is  speaker 
identification  where  given  a  sample  of  speech,  the  system  finds  the  closest  match  in  the 
database  and  reports  it  as  the  result.  The  second  is  speaker  verification  where  someone 
makes  a  claim  about  their  identity,  and  the  system  determines  if  the  claim  is  valid.  This 
thesis  will  address  both  areas. 

1.2  Problem  Statement 

Develop  and  compare  the  performance  of  two  temporally  based  speaker  recognition 
systems;  hidden  Mjirkov  models  and  Feature  Space  Trajectory  (FST)  Neural  Networks, 
each  using  neural  post-processing. 

1.3  Scope 

The  data  used  in  this  thesis  is  firom  the  YOHO  database.  YOHO  speech  utterances 
are  firom  a  real-world  office  environment  collected  using  a  high-quedity  telephone  handset. 
Each  utterance  consists  of  a  combination  lock  phrase  of  the  form  “twenty-four,  sixty-seven, 
eighty-two”  with  8  kHz  sampling  and  3.8  kHz  bandwidth  [1]. 

This  database  was  developed  by  ITT  and  is  the  leurgest  supervised  database  of  its 
t3q)e.  YOHO  heis  been  configured  to  allow  testing  at  the  75%  confidence  level  for  de¬ 
termination  of  meeting  the  0.1%  false  rejection  and  1.0%  false  acceptance  criteria.  This 
datab2ise  contains  138  speakers  (32  females  and  106  males)  firom  which  data  was  collected 
over  14  sessions  for  each  speaker  [1].  For  the  pinposes  of  this  thesis,  only  the  32  females 
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will  be  used.  Time  constraints  drove  the  need  to  work  on  a  subset  of  the  database.  The 


entire  set  of  females  was  chosen  because  it  represents  a  much  more  complex  problem  than 
performing  speaker  recognition  on  a  database  of  16  males  and  16  females. 

Speaker  recognition  performance  will  be  reported  in  terms  of  Equal  Error  Rate 
(EER).  EER  is  defined  as  the  point  where  the  number  of  False  Acceptances  is  equal  to  the 
munber  of  Fsilse  Rejections  to  a  system. 

1.4  Approach 

The  first  step  is  the  development  of  a  simple,  word-based  HMM  speaker  recognition 
system.  The  single-layer  perceptron  (SLP)  post-processor  is  also  developed  to  determine 
if  it  provides  any  enhancement.  The  second  step  is  the  development  of  a  word-based 
FST  speaker  recognition  system.  Once  developed,  the  SLP  post-processor  is  applied  to 
determine  the  improvement  in  performance. 

1.5  Thesis  Organization 

Chapter  II  provides  backgroimd  information  on  the  methods  used  in  this  thesis. 
Chapter  III  contains  a  description  of  the  methodology  used  in  the  accomplishment  of  this 
research.  Chapter  IV  contains  the  results,  and  Chapter  V  contains  a  discussion  of  the 
results  and  suggestions  for  futmre  work. 
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II.  Background 


2.1  Introduction 

This  chapter  provides  the  necessary  background  information  to  understand  the  meth¬ 
ods  used  in  this  thesis  for  the  speaker  recognition  problem.  To  start,  the  YOHO  database 
will  be  described  along  with  feature  generation.  Next,  hidden  Markov  models  (HMM) 
will  be  discussed  in  detail  as  well  as  the  Feature  Space  Trajectory  (FST)  Neural  Network 
developed  by  Neiberg  and  Casasent  [2-7].  In  addition,  the  single-layer  perceptron  will  be 
discussed  due  to  its  use  as  a  post-processor  following  both  the  HMM  and  the  FST. 

2.2  YOHO  Database 

The  YOHO  database  [1]  was  developed  by  ITT  and  is  available  from  the  Linguistic 
Data  Consortium.  It  was  created  as  a  standard  to  be  used  in  the  development  of  speaker 
verification  systems.  The  database  contains  speech  from  138  speeikers  (32  female  and  106 
male)  in  the  form  of  combination  lock  phrases.  Each  phrase  consists  of  three  numbers 
in  the  form:  ’’ninety-seven,  sixty-three,  twenty-four.”  The  vocabulary  has  been  limited 
such  that  no  value  imder  twenty  is  permitted,  the  number  eight  cannot  be  used,  doublets 
(twenty-two,  thirty-three,  etc.)  are  excluded,  and  the  decade  numbers  (twenty,  thirty,  etc.) 
are  not  allowed. 

The  data  was  collected  over  a  3-month  period  in  a  real-world  office  environment.  A 
telephone  handset  was  used  to  collect  the  data  with  8  kHz  sampling  and  3.8  kHz  band¬ 
width  [Ij.  Each  subject  took  part  in  4  enrollment  sessions  with  24  phrases  per  session  and 
10  verification  sessions  with  4  phrases  per  session. 
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2.3  Mel-Frequency  Cepstral  Coefficients 


Mel-Prequency  Cepstral  Coefficients  (MFCC)  are  used  as  features  [8,9].  Coefficients 
are  obtained  from  analysis  frames  20  nasec  in  length  at  10  msec  intervals.  Twenty-four 
Mel  frequency  spectral  coefficients  (MFSC)  are  generated  by  twenty-four  triangffiar  filters. 
The  filters  are  spaced  linearly  below  1  kHz  and  log2u:ithmic  above.  This  results  in  twenty- 
fomr  MFSC  which  are  reduced  to  twelve  MFCC  through  application  of  a  Discrete  Cosine 
Transform.  Log  energy  is  appended  to  the  twelve  MFCCs  for  a  bziseline  feature  set  of 
thirteen  dimensions.  The  transitional  coefficients  delta  and  delta-delta  are  also  appended 
to  provided  a  thirty-nine  dimensional  feature  vector  for  each  frame,  as  shown  below. 

^-39  [  LogEnergyi3  A14.26  ^^27-39  1 

2.4  Hidden  Markov  Models 

The  hidden  Markov  model  (HMM)  is  a  probabilistic  technique  for  the  modeling 
of  temporal  data  [10].  HMMs  consist  of  states  that  can  be  interconnected  in  different 
manners.  Two  ways  that  states  may  be  connected  include  ergodic  and  left-to-right.  The 
ergodic  model  is  fully  interconnected,  meaning  a  transition  may  occm  to  any  other  state. 
The  left-to-right,  or  Bakis,  model  h£is  the  constraint  of  starting  in  the  first  state,  anH 
finishing  in  the  final  state  without  going  backweurds. 

There  also  exist  two  subtypes  of  HMMs,  discrete  and  continuous,  that  describe  the 
systems  they  are  attempting  to  model.  Discrete  HMMs  are  characterized  by  a  finite  obser¬ 
vation  symbol  alphabet  and  corresponding  probability  mass  function.  Continuous  HMMs 
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are  characterized  by  modeling  the  observations  as  continuous  random  variables  with  asso¬ 
ciated  probability  density  functions. 

24.1  Hidden  Markov  Model  Parameters.  The  number  of  states  and  structure 
of  the  HMM  are  important  characteristics,  but  there  are  additional  parameters  that  are 
required  to  define  an  HMM. 

1.  N  -  The  number  of  states  of  the  model. 

2.  M  -  The  number  of  observation  symbols  in  each  state. 

3.  TT  -  Initial  state  distribution.  Each  entry  corresponds  to  the  probability  of  being  in 
state  Qi  for  the  initial  observation. 

4.  A  -  Transition  Matrix  {N  x  N).  Each  entry  (oy)  corresponds  to  the  probability  of 
transitioning  firom  state  qt  to  state  qj. 

5.  B  -  Observation  Symbol  Distribution  Matrix  {N  x  M).  Each  entry  (&<*  corresponds 
to  the  probability  of  being  in  state  qi  and  observing  the  symbol  k. 

Since  the  parameters  N  and  M  can  be  derived  from  the  matrix  dimensions,  only 
the  parameters  tt.  A,  and  B  are  required  to  completely  define  an  HMM.  This  definition  is 
usually  in  the  form  of  A  =  (tt.  A,  B)  [11]. 

2.5  Hidden  Markov  Model  Building  Blocks 

There  are  three  basic  problems  which  must  be  solved  to  apply  HMMs  to  any  task  [11] . 
The  first  is  that  given  an  HMM,  A,  and  an  observation  sequence,  O  =  {01,03,..., Ox}, 
what  is  the  probability  that  the  sequence  came  from  this  model,  P(0|A)?  For  example,  in  a 
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speaker  identification  test,  each  speaker  has  a  model  and  identification  is  accomplished  by 
finding  the  best  fitting  model,  as  determined  by  the  highest  probability.  The  second  is  to 
find  the  optimal  state  sequence,  q,  given  O  &  A.  In  our  speaker  identification  example,  this 
corresponds  to  finding  the  path  through  a  given  speaker’s  model  that  results  in  the  highest 
probability.  The  third  problem  is  to  adapt  the  parameters  of  a  given  HMM,  A  =  (tt.  A,  B), 
to  maximize  P(0|A).  This  problem  is  often  referred  to  as  training  the  HMM.  In  speaker 
identification,  as  with  most  other  HMM  applications,  this  is  a  crucial  aspect.  A  poorly 
trained  system  results  in  poor  results.  The  following  algorithms  solve  these  three  problems. 

2.5.1  Forward  Algorithm.  One  way  to  calculate  P(0|A)  is  the  Forwaurd  Algo¬ 
rithm.  In  this  procedure,  given  O  and  A  you  start  with  the  first  observation,  Oi,  and  work 
“forward”  through  the  data  sequence  imtil  the  end  is  reached.  This  produces  a  probability 
which  is  the  sum  over  all  possible  state  sequences  evaluated  at  o  =  Ot.  The  algorithm  is 
shown  below  [11]. 

1.  Initialization 

ai{i)  =  rtibi{oi),  l<i<  N  (2) 

2.  Induction 

at+iO)  =  [S  &i(ot+i),  ^  j  ^  ^  (3) 

3.  Termination 

P(0|A)  =  f;Mi)  (4) 

<=1 
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4.  NOTE:  The  forward  variable  is  defined  as  at{i)  =  jP(oi,  02,  ...Ot,  qt  —  i|A)  which  rep¬ 
resents  the  probability  of  the  observation  sequence  {01,02,  ...Ot}  and  state  i  at  time 
t  for  the  given  model  A 

2.5.2  Backward  Algorithm.  Instead  of  beginning  with  the  first  observation,  we 
can  start  with  the  final  observation  and  work  our  way  “backward”.  This  procedure  is 
primarily  used  to  produce  the  backward  variable  for  use  in  training  the  HMM: 


A(*)  =  ^’(ot+iOi+2...0Tlqt  =  i,  A)  (5) 

The  backward  variable  is  defined  as  the  probability  of  the  observation  sequence  (oj+i,  Oi+2”*or} 
given  state  qt  =  i  and  a  model  A.  Calculation  of  the  backward  variable  follows  [11]. 

1.  Initialization 

Brii)  =  1,  l<i<N  (6) 

2.  Induction 

=  P) 

j=l  —  — 

3.  Termination 

N 

PiO\X)  =  '^iTMoi)l3i{i)  (8) 

<=i 

2.5.3  Viterbi  Algorithm.  The  Viterbi  Algorithm  is  used  to  find  the  most  likely 
state  sequence  using  an  algorithm  similar  to  that  of  the  forward  procediure.  The  difference 
is  that  instead  of  keeping  track  of  probabilities  for  every  possible  path;  it  is  only  concerned 
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with  the  best  path.  An  array  is  used  to  store  the  state  sequence  that  corresponds  to  the 
optimal  probability.  The  general  procedure  for  the  Viterbi  Algorithm  is  found  below  [11]: 


1.  Initialization 


^i(i)  =  7r,-6,(oi)  1  <  i  <  JV 


V>i(i)  =  0 


NOTE:  6t(i)  represents  the  highest  probability  for  the  observations  {oi,  02...0t}  along 
a  single  path  at  time  t  and  ends  in  state  i.  iptii)  is  an  array  that  holds  the  argument 
which  maximizes  6  for  each  t  and  qt. 


2.  Recmrsion 


7  <  t  <  T 

V’t(i)  =  arg  max  ^  f  “  T, 


3.  Termination 


P*  =  max  [5r(*)] 


qj,  =  arg^m^[^T(i)] 


4.  Path  (state  sequence)  backtracking 


q*  =  ■0t+i(q;+i) 
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In  most  cases,  HMMs  are  implemented  using  log  values  to  avoid  underflow  of  the  host 
machine.  Underflow  is  caused  by  the  extremely  small  probabilities  that  occur  when  dealing 
with  large  models  and  large  amounts  of  input  data. 

2.5.3. 1  Forced  Viterbi  Alignment.  The  Viterbi  alignment  procedure  de¬ 
scribed  above  uses  all  possible  state  sequences  to  determine  its  score.  A  different  approach 
is  to  ‘force’  only  certain  allowed  state  sequences.  For  example,  aussume  an  HMM  system 
with  word  level  models  using  the  YOHO  database.  With  the  normal  Viterbi  procedure 
described  in  Section  2.5.3,  all  possible  combinations  of  word  models  will  be  examined  to 
determine  the  best  match.  In  forced  Viterbi  alignment,  only  the  words  corresponding  to 
the  utterance  will  be  used  to  determine  the  score.  This  research  will  use  forced  Viterbi 
alignment  due  to  the  constraint  that  the  combination  lock  phrase  being  uttered  is  known 
to  the  system. 

2.5.4  Baum- Welch  Re-estimation.  The  problem  of  adapting  the  model  pairame- 
ters  to  maximize  P(0|A)  can  be  solved  in  many  different  ways.  One  of  the  most  well-known 
is  the  Baum- Welch  Method.  This  procedure  uses  a,  /?,  and  7  from  the  forweurd  and  back¬ 
ward  algorithms.  7  is  calculated  from  a  and  /3  using  Equation  17,  where  it  represents  the 
probability  of  being  in  state  i  at  time  t  for  a  given  observation  sequence,  O,  Jind  model  A. 

The  basic  steps  are  to  first  determine  ^t{i,j),  the  probability  of  being  in  state  i  at 
time  t  and  state  j  at  time  t  -|-  1.  Taking  the  summation  of  ^tihj)  from  t  =  1  to  t  = 
T  -  1  yields  the  expected  number  of  transitions  from  state  i  to  state  j  in  O.  A  similar 
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summation  of  7t(i)  yields  the  expected  number  of  transitions  from  state  i  in  O  [11]. 


t  (■  i)  =  Q!f(i)ao&,(Of+i)A+i(i) 


(16) 


All  the  parameters  are  now  in  place  to  perform  the  re-estimation.  The  goal  is  to  optimize 
the  model  for  the  given  observation  sequence.  Therefore,  tt,  A,  and  B  must  be  recalculated. 
The  equations  are  given  below  [11]: 


Application  of  these  steps  results  in  a  new  HMM,  A.  By  replacing  A  with  A  this  procedure 
can  be  applied  iteratively  imtil  there  is  a  minimal  difference  between  A  and  A  in  successive 
iterations. 


2.6  Feature  Space  Trajectory 

The  Featme  Space  Trajectory  Neiural  Network  waus  developed  by  Neiberg  and  Casasent 
[2]  for  application  to  the  multi-claiss  pattern  recognition  problem.  To  this  point,  it  has  only 
been  used  on  images  [2-7].  This  research  has  focused  on  extending  the  techniques  that 
allowed  FSTs  to  work  with  images  so  that  they  could  work  with  speech. 
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2.6.1  What  is  a  Trajectory?  The  trajectory  and  what  it  represents  is  the  heart 
of  the  FST.  A  trajectory  is  simply  a  series  of  interconnected  points  in  feature  space,  as 
in  Figure  1.  These  points  axe  called  vertices.  Since  these  points  are  connected,  there  is 
some  relationship  between  them.  In  the  case  of  speech,  these  points  correspond  to  the 
features  from  a  given  frame  of  speech.  Therefore,  by  ordering  the  points  according  to  their 
occurrence  in  a  speech  utterance,  a  trajectory  natmally  encodes  the  temporal  aspect  of 
the  speech. 


Figmre  1.  Sample  Tirajectory  in  Three  Dimensional  Space 


2.6.2  How  do  FSTs  Work?  The  first  step  in  constructing  an  FST  system  to 
perform  classification  is  to  formulate  a  database  of  trajectories  for  each  class  of  the  problem. 
To  perform  classification,  an  unknown  trajectory  is  compared  to  the  database,  and  the 
trajectory  that  is  the  smallest  distance  from  the  unknown  is  classified  as  the  winner.  It  is 
important  to  note  that  instead  of  comparing  vertex  to  vertex,  as  would  happen  in  a  nearest 
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neighbor  classifier,  the  FST  compares  an  unknown  vertex  with  the  closest  trajectory  from 
the  database  [6]. 

2. 6. 2.1  Trajectory  Creation.  Each  trajectory  is  constructed  of  multiple 
segments  that  are  defined  by  a  length,  U,  and  direction,  Vi  [6].  In  addition,  the  vector 
innpr  products,  Cj^j+i  3X6  required  for  distance  calculations  to  the  test  data.  These  values 
aure  calculated  based  on  the  feature  vectors  from  the  training  data,  Xj. 

li  =  —  x<|| 

Xi+i  Xj 

•  ®i+l 


(20) 

(21) 

(22) 


Figmre  2.  Illustration  of  Distance  Calculation  Geometry 
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2. 6. 2.2  Distance  Calculations.  When  a  test  trajectory  is  tested  against 
the  training  data,  distances  are  calculated  from  the  unknown  vertices,  pj,  to  the  closest 
segment  of  the  training  trajectory.  In  order  to  calculate  the  distance,  some  intermediate 
values  must  be  calculated  in  order  to  project  pi  onto  the  known  trajectory,  see  Figure  2. 
The  variable  a  is  used  to  denote  the  position  p'  where  p  projects  onto  the  segment  v.  In 
the  equation  below,  u  is  defined  as  the  distance  from  x  to  p. 


ct  =  U  •  V 


(23) 


There  are  three  possible  values  for  a  [6]: 

1.  a  is  negative  —  the  point  does  not  fall  on  the  segment.  The  distance  is  calculated 
to  the  segment  start  point. 

d  =  IN-pll  (24) 

2.  a  is  positive  and  less  than  I  —  the  point  is  on  the  segment  and  distance  calculated 
from  p  to  p'.  Let  a  =  1  —  j]  b  =  j. 


d®  =  p .  p  —  2ap  •  xi  —  2bp  •  X2  -I-  o*ci,i  +  2a6ci,2  +  b^C2,2  (25) 

3.  a  is  positive  zmd  larger  than  I  —  the  point  does  not  fall  on  the  segment  and  since 
the  closest  point  is  the  endpoint,  it  is  considered  with  the  next  segment. 

For  each  of  the  vertices  p^,  the  minimum  distances  to  the  training  trajectory  are  summed 
to  get  on  overall  distortion  measure  between  trajectories  [2].  The  training  trajectory  which 
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has  the  minimal  distortion  value  indicates  class  membership  of  the  test  trajectory.  This 
implementation  allows  any  vertex  from  the  test  trajectory  to  be  mapped  to  any  segment 
on  the  training  trajectory. 

2.6.3  Issues  with  regard  to  speech.  There  are  two  main  problems  that  exist  for 
applying  FSTs  to  speech.  In  its  original  form,  the  FST  is  very  effective  when  comparing 
trajectories  that  contain  an  equal  number  of  vertices  and  segments.  This  makes  the  sum¬ 
mation  of  minimum  distances  an  acceptable  metric.  However,  if  the  trajectories  differ  in 
size,  the  siun  is  no  longer  a  valid  distortion  measure.  The  FST  in  its  present  form,  allows 
for  any  point  of  the  unknown  trajectory  to  map  to  any  segment  on  the  training  trajectory. 
That  is,  the  first  point  may  map  to  the  third  segment  and  the  second  point  may  map  to  an 
earlier  segment.  Since  a  speech  signal  is  a  temporal  process,  an  order  must  be  established 
where  vertices  cannot  be  mapped  to  earlier  segments.  The  following  sections  address  these 
issues. 


2. 6.3.1  Distortion  Metric.  The  summation  of  minimum  distances  is  not 
valid  when  comparing  trajectories  of  different  lengths;  however,  the  mean  of  the  minimiim 
distances  corrects  this  problem.  The  mean  is  a  valid  solution  because  it  normalizes  the  dis¬ 
tortion  metric  and  provides  a  baisis  for  comparison  among  trajectories  of  differing  lengths. 
Consider  the  mean  distance, 

D  =  (26) 

where  di  represents  the  distance  from  vertex  pt  of  the  test  trajectory  to  the  closest  point  on 
the  training  trajectory.  The  distance  is  calculated  for  all  N  vertices  of  the  test  trajectory. 
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£.6.3.2  Dynamic  Time  Warping.  Dynamic  Time  Warping  (DTW)  has  been 
used  to  compare  speech  utterances  of  different  lengths  for  many  years.  However,  DTW  is 
another  technique  where  vertices  axe  compared  to  vertices.  Therefore  the  FST  must  be 
adapted  to  provide  DTW  where  an  unknown  vertex  is  compared  to  the  closest  trajectory. 
An  elegant  way  of  performing  DTW  ‘on-the-fly’  is  by  using  Ney’s  algorithm  [12].  This  is  a 
one-stage  algorithm  that  has  been  applied  to  the  problem  of  connected  word  recognition. 
It  provides  the  advantages  of  word  bovmdary  detection  and  nonlinear  time-alignment  to 
enhance  recognition  performance.  The  advantage  of  performing  this  in  the  context  of  the 
FST  algorithm  is  that  errors  which  may  be  introduced  by  missed  word  boimdary  detection 
and  improper  time  alignment  are  removed  [12]. 

£.6.3.3  Template  Generation.  It  is  impossible  for  a  person  to  say  an  utter¬ 
ance  exactly  the  same  way  more  than  once.  This  leads  to  differences  in  the  feature  space 
vertices  and  therefore  differences  in  trajectories.  One  approach  for  template  generation 
may  be  to  construct  a  database  of  every  utterance  for  every  person  in  the  database.  This 
procedure  leads  to  a  training  database  that  is  difficult  to  test  against.  For  example,  assume 
that  trajectories  will  be  constructed  for  each  word  and  that  each  word  will  be  spoken  ten 
times  in  training.  If  we  are  trying  to  recognize  just  two  combinations  of  words,  the  FST 
would  need  to  check  each  of  the  ten  trajectories  of  the  first  word  with  each  of  the  ten  trajec¬ 
tories  of  the  second  word  resulting  in  100  comparisons.  If  you  factor  in  multiple  speakers, 
say  10,  the  result  is  1000  comparisons  in  order  to  determine  the  result.  In  contrast,  if  one 
template  per  word  per  speaker  could  be  developed,  the  number  of  compsurisons  drops  to 
10;  two  orders  of  magnitude  difference. 
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The  problem  is  how  to  effectively  reduce  multiple  utterances  of  one  word  into  a 
trajectory  that  is  representative  of  them  all  and  still  allows  for  discrimination  &om  other 
speakers.  One  method  for  accomplishing  this  could  be  to  apply  Vector  Quantization  (VQ) 
to  the  training  data  to  establish  a  codebook  that  represents  the  trajectory  [11].  The 
codebook  is  ejisily  generated  by  standard  VQ  techniques;  however,  due  to  the  structure 
of  the  FST,  the  temporal  information  of  the  codewords  must  be  maintained.  Maintaining 
temporal  information  in  an  FST  classifier  will  be  a  large  focus  in  this  effort  and  at  present 
is  the  largest  hurdle  to  overcome  in  applying  FSTs  to  speech. 

2. 7  Neural  Post-Processing 

The  optimal  Bayes  classifier  makes  use  of  the  maximum  a  posteriori  probability. 
Baum-Welch  re-estimation  only  provides  maximum  likelihoods  while  FSTs  produce  mini¬ 
mum  distance  based  decisions.  Since  perceptrons  can  provide  outputs  which  approximate 
the  maximum  o  posteriori  probability  [13],  their  use  as  a  post-processor  will  be  investi¬ 
gated.  Benson  and  Bemander  investigated  this  option  in  their  work  on  speech  recognition 
and  achieved  favorable  results  [14].  The  single-layer  perceptrons  will  accept  the  outputs 
firom  the  HMMs  or  FSTs  as  inputs.  Thirty-two  output  nodes  will  be  used  with  one  node 
for  each  speaker.  In  the  identification  experiment,  the  output  node  with  the  TnaviTnnm 
value  will  be  chosen  as  the  identified  speaker. 

2.8  Summary 

This  chapter  provided  background  material  on  the  methods  that  will  be  used  to 
implement  the  speaker  recognition  systems.  The  HMM  was  detailed  and  will  be  used  as 
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a  baseline  system  for  comparison  against  the  newly  developed  FST  system.  In  addition, 
enhancement  of  the  methods  by  adding  an  SLP  post-processor  to  utilize  the  maximmn  a 
posteriori  probability  will  be  examined  in  Chapters  III  and  IV. 
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III.  Approach 


3.1  Introduction 

This  chapter  discusses  the  application  of  HMMs  and  FSTs  to  the  speziker  recognition 
problem  as  well  as  how  these  methods  can  be  compared  to  one  another. 

3.2  Hidden  Markov  Model 

In  order  to  accomplish  speaker  recognition  using  HMMs,  all  of  the  basic  blocks  de¬ 
fined  in  Chapter  II  must  be  used.  The  Hidden  Mzurkov  Model  Toolkit  (HTK)  is  used  to 
accomplish  this  task.  HTK  was  developed  by  Entropic  Research  Laboratory,  Inc  and  a 
complete  description  can  be  foimd  in  [15]. 

a(l, 
b(l,i) 

a(i,j)  =  transition  probability  from  state  i  to  state  j 
b(k,l)  =  probability  of  observing  1  in  state  k 

Figmre  3.  Five  State  Left-to-Right  HMM 
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The  HMM  speaker  recognition  system  consists  of  word  modek  for  each  speaker.  Five- 
state  left-to-right  modek  are  created  using  HTK.  All  of  the  training  data  for  the  32  females 
in  YOHO  Jire  used  to  train  the  modek  and  they  were  tested  against  the  entire  set  of  test 
data.  Each  speaker  has  96  combination-lock  phrases  of  training  data  and  40  combination- 
lock  phrases  of  test  data.  A  test  consisted  of  running  an  utterance  through  each  speaker’s 
model  and  obtaining  a  log-likelihood  score.  This  score  is  the  basis  of  classification. 

3.3  Speaker  Identification 

The  speaker  identification  phase  is  the  first  step  in  a  speaker  verification  system. 
The  modek  that  produce  the  maximum  Viterbi  log-likelihood  value  identify  the  unknown 
speaker.  This  corresponds  to  the  basic  Bayesian  classifier  (assuming  equal  priors)  where 
speaker  model  i  chosen  to  represent  the  speaker  of  a  given  utterance  U  [8]. 

i  =  argmax  {  logp(W|A*)  } ,  1  <  fe  <  32  (27) 

k 

3.4  Speaker  Verification 

Speaker  verification  utilizes  the  results  from  the  speEiker  identification  phzise  for  se¬ 
lecting  reference  speakers.  Reference  speakers,  ako  known  as  cohorts,  are  used  to  normalize 
the  log  likelihood  ratio  of  the  test  utterance  using  the  claimed  speaker’s  model  [16-18]. 
The  log-likelihood  ratio  C  is  derived  firom  the  Bayes  optimal  decision  rule  for  classifying 
true  speakers  against  impostors.  Classification  k  made  by  comparing  £  to  a  threshold. 
Define  the  log-likelihood  ratio  of  an  utterance  U,  with  a  reference  “cohort”  set  size  denoted 


20 


by  \C\,  as  the  following  approximation[19], 


C{U)  =  logp{U\K,,^)-~'£^ogpiU\Xj)  (28) 

1^1  jec 

where  Xciaim  is  the  claimed  speaker’s  model  and  is  one  model  from  the  claimed  speaker’s 
cohort  set.  This  has  recently  been  called  the  Geometric  Mean  normahzation  [18]. 

3.4-1  Cohort  Selection.  The  set  of  cohorts  C  will  be  selected  as  “close”  speakers 
based  on  log-likelihood  Viterbi  scores  using  all  enrollment  data.  Cohorts  are  selected  for 
each  speaker  based  on  the  smallest  distortion  metric  using  the  three  methods  listed  below. 

1.  Difference  of  Means  [17]  or  (DOM)  sorts  by  mean  difference  of  log-likelihoods  enroll¬ 
ment  scores. 


dooMiXuXj)  =  log 


p{U\Xi) 


p{U\Xj) 

=  logp(W|A<)  -  logp(W|A,) 


(29) 


2.  Reynold’s  [20]  Symmetric  method  sorts  on  pairwise  log-likelihood  ratio  enrollment 
information  (If  speaker  i  is  “close”  to  speaker  j  then  the  reverse  must  be  true). 


dsYAfiKiXj) 


SWIAj) 


+  log 


pj^i) 

pmxj) 


(30) 
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3.  Second  order  Bhattacharyya  measure  [21]  sorts  cohorts  using  the  variance  of  the 


enrollment  likelihoods,  as  well. 

where  mj  represents  the  enrollment  likelihood  mean  and  a?  represents  the  enrollment 
likelihood  variance.  Fielding  [22]  has  shown  how  the  use  of  second-order  statistics 
can  be  useful  for  HMM  model  comparisons. 


deiXi 


,A,)  ^ 


m,-)^  1 

4W+<r|)  ■^2'“® 


3.5  HMM  with  Single- Layer  Perceptron  Post-Processor 

This  approach  t2ikes  advantage  of  the  discriminative  power  of  perceptrons.  HMMs 
are  very  eflFective  at  modeling  the  temporal  characteristics  of  speech;  however,  since  they 
rely  on  maximum  likelihood  estimation  (MLE),  they  are  not  necesseurily  discriminative. 
This  approach  employs  a  SLP  as  a  post-processor  to  improve  classification  performance. 
The  SLP  accepts  as  inputs  the  log  likelihoods  produced  by  the  HMMs  and  selects  a  speaker 
with  the  highest  probability  of  having  spoken  the  utterance  imder  test. 

In  order  for  the  SLP  to  be  effective  as  a  post-processor,  the  log  likelihoods  from  the 
HMMs  must  be  normalized.  In  general,  these  values  are  within  a  specific  range  without  a 
large  degree  of  separation.  For  the  females  from  YOHO,  the  entire  set  of  log  likelihoods 
range  from  -80  to  -61.  In  terms  of  probabilities,  these  values  are  different  by  nineteen  orders 
of  magnitude,  but  the  log  values  decrease  this  distance  measure.  In  order  for  perceptrons 
to  train  effectively,  these  value  must  be  normalized. 
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A  widely  accepted  method  of  normalizing  SLP  input  data  is  statistical  normalization 
[23].  In  this  method,  each  feature,  j,  (log  likelihood  from  a  specific  speaker)  is  normalized 
such  that  it  has  zero  mean  and  unity  variance,  see  Equation  32.  This  attempts  to  balance 
the  discriminant  nature  of  each  feature  with  respect  to  all  others. 


t  _  ~ 

Although  statistical  normaUzation  is  the  method  used  in  this  thesis,  two  others  were  in¬ 
vestigated.  The  first  is  energy  normalization.  In  this  technique,  each  element,  cy  of  the 
featmre  vector  is  divided  by  the  vector  magnitude  such  that  the  resulting  vector  is  of  unit, 
magnitude,  see  Equation  33.  Energy  normalization  attempts  to  captme  discriminant  in¬ 
formation  between  elements  of  a  single  feature  vector.  The  problem  is  that  the  relative 
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distance  relationship  with  other  samples  is  lost. 


Ikill 


(33) 


The  second,  compliment  coding  is  a  normalization  technique  which  seeks  a  middle  ground 
between  the  two  techniques  mentioned  above.  The  first  step  is  to  energy  normalize  the 
features  to  obtain  vectors  of  unit  magnitude,  as  in  Equation  33.  Next,  the  compliment 
of  each  element  is  taken  and  appended  to  the  vector  as  shown  in  Equations  34  and  35. 
This  doubles  the  amount  of  features,  but  has  the  advantage  of  retaining  relative  distance 
information  among  samples. 

“ii  =  1  -  «ii  (34) 

I  =  [aa‘]  (35) 


3.6  Cohort  Selection  Using  Neural  Post-Processing 

Since  the  outputs  of  the  SLP  can  approximate  the  a  posteriori  probabilities  [13]  and 
cohort  selection  attempts  to  find  the  speakers  that  are  “close”  to  one  another,  it  is  possible 
to  use  the  SLP  outputs  as  criteria  for  selecting  the  cohort  speakers.  The  SLP  outputs 
are  able  to  represent  higher  order  relationships  better  than  the  raw  log  likelihoods.  Since 
the  HMM  relies  on  determining  the  maximum  likelihood,  it  can  only  capture  a  first  order 
relationship.  Therefore,  the  SLP  has  the  advantage  of  using  more  information  to  make  a 
decision. 

In  order  to  select  cohorts  using  the  SLP,  each  utterance  is  rim  through  the  HMMs 
with  the  resulting  log  likelihoods  used  as  inputs  to  the  SLP.  Each  output  of  the  SLP  is 
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associated  with  one  of  the  speakers.  These  outputs  of  the  SLP  are  ranked  from  highest 
to  lowest,  with  the  highest  representing  the  speaker  with  the  highest  probability  of  saying 
the  utterance  [13],  see  Table  1.  This  process  is  performed  on  each  test  utterance.  Once 
all  utterances  for  a  particular  speaker  are  complete,  the  individual  rankings  are  summed 
to  get  on  overall  ranking.  This  final  ranking  is  used  as  the  basis  for  cohort  selection,  see 
Table  2. 


Table  1.  Ranking  of  output  node  values  for  one  utterance 


Output  Node 

Output  Value 

Rank 

1 

0.9789 

1 

32 

0.0679 

2 

29 

0.0398 

3 

26 

0.0305 

4 

20 

0.0200 

5 

9 

0.0094 

6 

4 

0.0089 

7 

24 

0.0079 

8 

6 

0.0074 

9 

12 

0.0065 

10 

8 

0.0044 

11 

5 

0.0044 

12 

27 

0.0032 

13 

22 

0.0030 

14 

7 

0.0026 

15 

17 

0.0025 

16 

2 

0.0021 

17 

10 

0.0018 

18 

3 

0.0016 

19 

11 

0.0015 

20 

19 

0.0013 

21 

16 

0.0007 

22 

23 

0.0006 

23 

15 

0.0005 

24 

14 

0.0005 

25 

13 

0.0004 

26 

30 

0.0004 

27 

28 

0.0002 

28 

25 

0.0000 

29 

21 

0.0000 

30 

28 

0.0000 

31 

31 

0.0000 

32 

25 


3.7  Feature  Space  Trajectory 


Prior  to  this  effort,  the  Featiire  Space  Trajectory  (FST)  Neural  Network  had  not  been 


apphed  to  speech.  Therefore,  development  of  such  a  system  must  be  undertaken  in  gradual 


steps.  The  first  step  necessary  is  to  determine  how  speech  signals  can  be  transformed  into 


trajectories. 


Table  2.  Ranking  of  output  node  values  for  all  training  utterances  for  speaker  1 


Cohorts 

Sum  of  Ranks 

Rank 

1 

96 

5 

762 

24 

961 

9 

965 

32 

987 

5 

17 

1081 

6 

20 

1094 

7 

22 

1174 

8 

26 

1273 

9 

3 

1284 

10 

23 

1309 

11 

15 

1368 

12 

12 

1433 

13 

2 

1504 

14 

7 

1593 

15 

29 

1659 

16 

16 

1696 

17 

4 

1702 

18 

1773 

19 

8 

1784 

20 

25 

1809 

21 

27 

1833 

22 

6 

1845 

23 

18 

1884 

24 

28 

1908 

25 

19 

2061 

26 

13 

2171 

27 

21 

2187 

28 

31 

2262 

29 

14 

2290 

30 

30 

2293 

31 

11 

2467 

32 

3.7.1  Trajectory  Construction.  In  the  original  FSTs  used  in  image  recognition, 
series  of  images  axe  used  to  characterize  an  object  and  create  a  trajectory  [2-7].  Concep¬ 
tually,  the  trajectory  is  created  by  connecting  sequential  points  in  featiure  space  via  line 
segments.  Adjacent  points  represent  images  that  have  a  temporal  order.  An  analogy  czin 
be  drawn  to  speech  signals  in  that  featmes  from  consecutive  frames  of  sampled  speech  can 
be  used  to  create  a  trajectory. 

The  next  question  is  at  what  level  of  speech  shoidd  trajectories  be  constructed.  It  is 
possible  to  create  trajectories  that  represent  an  entire  utterance  from  YOHO.  The  problem 
is  that  every  utterance  is  different  in  that  one  may  be  “Ninety-three,  Fifty-seven,  Thirty- 
two”  while  another  may  be  “Twenty-four,  Forty-six,  Eighty-one.”  Even  if  the  same  person 
spoke  both  utterances,  the  trajectories  are  vastly  different  due  to  the  difference  in  words 
that  comprise  them.  To  remedy  this  problem,  word  level  trajectories  have  been  chosen. 
By  constructing  trajectories  at  the  word  level,  individual  words  can  be  concatenated  to 
make  every  possible  utterance  in  YOHO.  If  one  trajectory  is  created  for  each  speaker  per 
word,  the  result  is  sixteen  trajectories  for  each  speaker.  In  contrast,  creating  trajectories 
for  every  possible  utterance  would  require  more  than  350,000  trajectories  per  speaker.  In 
addition  to  an  abimdance  of  storage  space  required,  the  training  time  for  each  speaker 
would  be  a  limiting  factor. 

The  next  step  is  determining  how  to  test  the  trajectories.  One  solution  is  to  use  every 
instance  of  each  word  as  a  training  trajectory.  For  the  word  ‘ONE’  alone  this  results  in  42 
trajectories  for  each  speaker  in  the  database.  When  attempting  to  test  an  entire  utterance, 
the  munber  of  combinations  resulting  from  concatenating  multiple  word  instances  pushes 
the  munber  of  test  runs  to  be  large.  However,  using  every  instance  is  useful  for  development 
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of  the  basic  techniques  for  applying  FSTs  to  speech  when  limiting  the  testing  to  only  one 
word  from  an  utterance. 

Using  this  idea  for  a  proof  of  concept,  the  initial  FST  for  speech  could  be  developed. 
However,  a  second  problem  is  produced.  All  instances  of  the  word  ‘ONE’  are  of  different 
lengths,  resulting  in  different  numbers  of  frames  and  therefore  different  length  trajectories. 
The  original  FST  for  image  recognition  is  designed  to  work  on  a  single  point  or  trajectories 
of  equal  lengths.  In  order  to  overcome  this  obstacle,  a  method  of  comparing  utterances  of 
different  lengths  must  be  developed. 

5.7.5  Incorporating  Key’s  Algorithm.  Ney’s  algorithm  is  an  efficient  way  of 
performing  Dynamic  Time  Warping  (DTW)  in  a  one-step  process  using  Dynamic  Pro- 
gramming  [12].  By  using  DTW,  two  signals  of  different  lengths  can  be  compared  to  one 
another  with  one  either  being  stretched  or  compressed  to  best  match  the  template.  The 
result  from  this  is  a  distortion  metric  which  indicates  the  distance  between  the  two  signals. 

In  Dynamic  Time  Warping  (DTW),  a  matrix  is  created  containing  distances  between 
points  of  two  sign2ds,  see  Figure  5(a).  In  the  case  of  speech  signals,  the  points  represent 
feature  space  locations  corresponding  to  individual  frames  from  an  utterance. 

The  idea  is  to  find  the  minimum  distance  path  through  the  matrix  while  applying 
certain  constraints.  The  constrziints  determine  the  path  that  can  be  created.  Figmre  5(b) 
shows  acceptable  transitions  and  the  weights  associated  with  each  move. 

The  idea  is  to  reward  tr2insitions  that  keep  the  path  on  the  upward  diagonal  while 
penalizing  those  moves  that  get  away  from  the  diagonal.  The  legal  predecessors  establish 
which  moves  are  allowed  and  the  cost  of  making  each  one.  For  example,  a  move  on  the 
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Figure  5.  (a)  is  a  representative  DTW  Distance  Matrix  and  (b)  represents  the  legal  moves 

and  their  weights 

diagonal  is  good  and  will  not  receive  a  penalty  (weight  =  1);  however,  to  stay  in  the  same 
frame  or  skip  a  frame  ahead  will  receive  a  penalty.  In  Figiure  5(b)  staying  in  the  same 
training  frame  has  a  weight  of  1.5  and  skipping  a  training  frame  has  a  weight  of  2.0.  The 
weights  are  important  when  determining  which  transition  to  make.  Figure  5(b)  shows 
the  detailed  distance  measurements  for  movement  from  frame  4  to  frame  5  of  the  test 
utterance,  the  area  in  the  bold  box  of  Figure  5(a).  Without  the  weights,  the  transition 
chosen  would  be  to  skip  one  frame  due  to  its  distance  of  14  which  is  lower  than  the  values 
for  moving  one  frame  ahead  or  staying  in  the  same  frame.  However,  when  the  weights 
are  considered,  the  transition  chosen  is  to  remain  in  the  same  frame.  Staying  in  the  same 
frame  has  a  distance  of  22.5,  moving  one  frame  forward  has  a  distance  of  25,  and  skipping 
one  frame  has  a  distance  of  28;  thus,  the  minimum  distance  of  22.5  forces  the  transition  to 
remain  in  the  same  frame.  The  values  shown  in  Figure  5  (b)  result  from  multiplying  the 
original  distances  (Figure  5(a))  by  the  associated  weight. 
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The  original  FST  finds  the  nearest  point  on  the  training  trajectory  to  a  test  point. 
This  is  done  without  regard  for  where  the  closest  point  lies  on  the  training  trajectory.  For 
speech,  since  temporal  data  is  involved,  it  is  important  to  consider  whether  a  point  from 
the  end  of  one  trajectory  maps  to  the  beginning  of  another  or  vice  versa.  In  the  same 
manner,  once  we  have  reached  a  given  point  in  the  speech  data,  we  do  not  want  to  allow 
the  point  to  map  to  an  earlier  segment  of  the  trajectory.  That  violates  the  temporal  nature 
of  speech. 

Toward  this  end,  Ney’s  algorithm  is  incorporated  to  improve  the  FST  so  as  to  produce 
a  Ney-based  FST.  In  this  version  of  the  FST,  compEurison  starts  at  the  beginning  of  the 
training  trajectory  and  moves  forward  just  eis  in  the  left-to-right  HMM.  The  only  moves 
allowed  are  to  stay  on  the  same  segment,  move  one  segment  ahead,  or  move  two  segments 
ahead,  see  Figure  5(b).  Staying  on  the  same  segment  corresponds  to  a  test  trajectory 
needing  to  be  compressed.  Moving  ahead  one  segment  is  the  desired  result  meaning  that 
the  test  and  training  trajectories  are  proceeding  at  the  same  rate.  Moving  two  segments 
ahead  means  that  the  test  trajectory  needs  to  be  expanded  to  provide  a  better  match  to 
the  training  trajectory. 

Using  the  proposed  Ney  FST,  a  proof  of  concept  test  is  constructed  where  each  of  the 
first  speaker’s  test  instances  of  the  word  ‘ONE’  would  be  tested  against  every  speaker’s 
training  instances  of  the  word  ‘ONE.’  This  shows  the  ability  to  use  FSTs  with  speech. 
However,  it  is  unrealistic  to  use  all  training  utterances  as  discussed  previously.  For  the 
word  ‘ONE,’  each  speaker  had  42  instances  in  training  data.  Therefore,  to  perform  speaker 
identification  requires  testing  1344  trajectories.  A  method  for  reducing  the  size  of  the 
training  set  while  maintaining  performance  is  the  next  au:ea  of  interest 
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5.7.5  Template  Generation.  In  order  to  reduce  the  amount  of  training  templates 
in  the  database,  it  is  desirable  to  find  a  way  of  determining  one  trajectory  which  is  rep¬ 
resentative  of  how  a  given  speaker  says  each  word  of  the  YOHO  vocabulary.  This  would 
result  in  16  templates  per  speaker,  one  for  each  word,  as  opposed  to  the  576  trajectories 
resulting  from  using  each  instance  of  the  word  ’ONE’.  Two  methods  of  determining  the 
optimum  template  are  developed. 

The  first  is  to  pick  the  instance  which  minimizes  the  distance  to  all  other  training 
instances  of  the  word.  This  process  involves  using  the  Ney  FST  to  obtain  a  distance 
measure  from  each  training  instance  to  every  other  training  distance.  The  one  that  results 
in  the  minimum  sum  of  distances  to  all  trajectories  is  picked  as  the  optimum  template. 

The  second  is  to  Vector  Quantize  (VQ)  all  of  the  training  data  for  a  given  word  from 
each  spe2Jcer  to  produce  a  codebook.  Here,  VQ  is  performed  two  different  ways.  One 
uses  a  Euclidean  distance  me2isure  while  the  other  uses  a  Mahalanobis  distance  measure. 
These  distances  provide  different  results  when  creating  the  codebook,  see  Figmre  6.  The 
Euclidean  distance  measure  assumes  the  clusters  are  spherical.  Mahalanobis  distance,  on 
the  other  hand,  makes  no  such  assumption  and  uses  the  mean  and  variance  of  the  data  to 
determine  the  shape  of  the  clusters  and  distances  normalized  relative  to  ellipse  size  and 
orientation  [24]. 

Codebook  generation  follows  the  basic  LEG  VQ  algorithm  [11].  However,  because 
the  system  seeks  to  take  advantage  of  the  temporal  natmre  of  the  speech  signal,  a  method  of 
recovering  the  temporal  ordering  of  the  codewords  is  necessary.  By  applying  an  utterance  to 
the  codebook,  it  is  possible  to  determine  which  codeword  each  frame  of  speech  data  mapped 
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(a) 


(b) 


Figure  6.  Illustrations  of  the  cluster  shapes  as  a  result  of  (a)  Euclidean  distance  distortion 
measure  and  (b)  Mahalanobis  distance  distortion  measiure 

to.  If  only  one  utterance  is  used,  the  solution  could  be  obtained  quickly  by  applying  the 
mapping  of  that  utterance  to  the  codebook.  However,  since  multiple  utterances  exist  in 
the  training  data,  a  way  of  combining  each  of  their  mappings  into  one  overall  temporal 
ordering  is  required. 

Two  methods  are  employed  to  determine  the  codeword  ordering  from  the  mappings 
of  each  utterance  to  the  codebook.  When  the  mappings  are  obtauned,  it  is  known  which 
codeword  each  frame  of  speech  is  mapped  to.  The  idea  is  to  determine  which  codewords 
are  mapped  to  by  frames  from  the  beginning,  the  middle,  and  end  of  the  utterance.  By 
analyzing  the  mappings,  it  is  possible  to  create  the  desired  order.  The  first  method  used 
was  to  determine  the  mean  of  the  frames  which  map  to  each  codeword.  The  codeword 
with  the  lowest  mean  is  placed  first,  and  the  codeword  with  the  highest  mean  is  placed 
last.  In  a  similar  manner,  the  median  of  the  mapped  frame  values  is  also  used.  This  rai^ 
remove  errors  in  the  mean  that  may  surface  due  to  a  few  unreasonably  high  or  low  values. 
Once  again,  the  codeword  with  the  lowest  median  is  placed  first  in  the  ordering  and  the 
highest  median  is  placed  last. 
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Therefore,  based  on  Vector  Quantization,  foiu:  different  methods  of  template  gener¬ 
ation  are  examined. 

1.  Euclidean  distance  with  mean  ordering. 

2.  Euclidean  distance  with  median  ordering. 

3.  Mahalanobis  distance  with  mean  ordering. 

4.  Mahalanobis  distance  with  median  ordering. 

For  comparison,  the  templates  from  each  of  the  five  template  generation  techniques  are 
tested  in  the  proof  of  concept  experiment. 

3.8  FST  with  SLP  Post-Processor 

The  post-processor  used  with  the  FST  is  the  same  configuration  as  that  used  with 
the  HMM.  A  single-layer  perceptron  is  used  that  accepts  32  inputs  and  provides  32  out¬ 
puts.  The  32  inputs  represents  the  distance  scores  from  the  test  trajectory  to  the  training 
trajectory  for  each  speaker.  This  facilitates  comparison  to  determine  the  performamce 
enhancement  provided. 

3.9  Summary 

This  chapter  defined  two  systems  for  accomplishing  speaker  recognition.  Multiple 
cohort  selection  techniques  were  introduced  including  the  newly  developed  SLP  technique. 
This  technique  takes  advantage  of  the  probabilistic  properties  of  perceptron  outputs  to 
provide  an  improved  basis  for  cohort  selection.  A  Ney-based  FST  was  introduced  which 
capitEilizes  on  the  temporal  structure  of  trajectories  by  incorporating  the  time  alignTTiput. 
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techniques  of  Ney’s  Dynamic  Time  Warping  Algorithm.  In  addition,  a  manner  for  com¬ 
bining  both  the  HMM  and  FST  with  an  SLP  post-processor  w£is  described.  The  SLP 
post-processor  is  used  to  improve  the  identification  and  verification  performance  of  these 
systems. 
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IV.  Results 


4.1  Introduction 

This  chapter  provides  results  of  the  speaker  recognition  techniques  described  in  Chap¬ 
ters  II  and  III.  The  first  technique  is  the  HMM  applied  to  speaker  identification.  The  second 
technique  expands  upon  the  first  by  using  the  speaker  identification  results  firom  the  HMM 
to  perform  speaker  verification.  The  third  technique  demonstrates  the  FST  in  performing 
speaker  identification.  The  chapter  concludes  with  a  direct  comparison  of  the  HMM  and 
FST. 

4.2  Hidden  Markov  Model  Speaker  Identification 

In  order  to  perform  speaker  identification,  word  level  HMMs  are  constructed  for  each 
speaker.  A  test  phrase  is  obtained  firom  the  YOHO  database  and  the  corresponding  word 
models  are  arranged  to  set  up  for  forced  Viterbi  alignment.  The  resulting  scores  firom  each 
speaker’s  model  are  then  used  to  make  the  identification.  The  model  with  the  highest 
Viterbi  score  is  chosen  as  the  identified  speaker. 

For  baseline  purposes,  speaker  level  HMMs  are  also  constructed.  Similarly,  these 
models  consist  of  5  states  but  are  ergodic,  see  Figure  7.  The  key  difference  is  that  there 
is  only  1  model  per  speaker;  but  with  word  level  HMMs,  there  are  16  models  per  speaker 
to  represent  each  of  the  16  words  firom  the  YOHO  vocabulary.  This  results  in  a  large 
reduction  in  the  number  of  firee  parameters.  Testing  is  conducted  in  the  same  manner  as 
with  word  level  HMMs. 
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a(i,j)  =  transition  probability  from  state  i  to  state  j 
b(k,l)  =  probability  of  observing  1  in  state  k 

Figure  7.  Five  State  Ergodic  HMM  -  Not  all  connections  shown 


Table  3.  Closed-set  speaker  identification  Rates(%)  for  1,  2,  and  4  combination  lock 
phrases  using  Equation  27. 


Speaker  level  models  are  very  simple  representations  of  the  speakers  in  the  database. 
Using  only  5  states,  they  attempt  to  model  every  test  utterance  in  YOHO.  This  leads  to 
low  system  performance  as  shown  in  the  first  line  of  Table  3.  Using  one  combination-lock 
phrase  the  speaker  model  identification  rate  was  70.23%.  Although  the  rate  increased  to 
80.16%  and  86.25%  when  two  eind  fom:  combination-lock  phrases  were  used,  there  is  still 
more  than  13%  error  in  the  system  that  can  be  eliminated. 

To  improve  performance,  one  possible  solution  is  to  divide  the  larger  problem  into  a 
series  of  smaller  problems.  This  division  is  represented  by  creating  HMMs  at  the  word  level. 
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In  this  manner,  there  are  5  states  used  to  represent  each  word.  Word  level  HMMs  use  a  total 
of  80  states  among  the  required  16  models  compzored  to  the  5  states  of  the  speaker  models. 
By  providing  more  states,  the  performance  of  the  system  shows  a  substantial  improvement 
in  Table  3.  There  is  a  tradeoff  that  occurs  between  desired  level  of  performance  and 
number  of  free  parameters.  The  word  level  models  perform  better  but  also  require  more 
calculations  due  the  increased  number  of  free  parameters. 

4.2.1  Neural  Post-Processor.  One  way  of  enhancing  the  performance  of  simple 
HMM  systems  is  to  add  neural  network  post-processing  to  the  back-end  of  the  system. 
In  this  thesis,  single-layer  perceptrons  (SLP)  are  used.  They  are  configured  to  accept  32 
inputs  and  provide  32  outputs.  The  32  inputs  correspond  to  the  Viterbi  score  provided  by 
each  of  the  HMMs. 

Table  4.  Closed-set  speedcer  identification  Rates(%)  for  1,  2,  and  4  combination  lock 
phrases  using  SLP  Post-processor. 


Method 

1 

2 

4 

Speadcer  Models 
Word  Models 

88.44 

93.52 

94.22 

97.97 

96.88 

99.06 

The  SLP  provides  improved  speaker  identification  in  all  cases,  see  Table  4.  These  re¬ 
sults  were  obtained  using  statistical  normalization.  All  normalization  techniques  discussed 
in  Chapter  III  were  examined,  but  statistical  normalization  outperformed  them  all. 

An  interesting  note  is  that  speaker  level  models  with  SLP  post-processing  perform 
as  well  as  or  better  than  word  level  models  2done.  The  simple  speaker  representation 
results  in  lower  template  storage  requirements  and  in  the  ability  for  the  system  to  process 
the  information  more  rapidly.  The  addition  of  the  SLP  post-processor  does  add  some 
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overhead;  but  this  is  primarily  during  training.  When  testing,  the  results  from  HMMs  are 
fed  directly  to  the  SLP  and  the  calculations  are  basically  instantaneous. 

4.3  Hidden  Markov  Model  Speaker  Verification 

The  same  HMMs  used  in  the  speaker  identification  experiment  are  used  for  speaker 
verification.  The  first  part  of  the  process  is  identical  to  the  speaker  identification  described 
above.  The  difference  is  that  during  training,  the  training  scores  are  used  to  create  the 
“cohort”  sets.  These  cohort  sets  represent  the  N  speakers  that  are  closest  to  a  given 
speaker.  That  is,  the  system  is  likely  to  confuse  one  of  the  cohorts  as  the  true  speaker.  By 
identifying  these  cohorts,  we  are  able  to  remove  them  from  the  testing  and  provide  better 
results  as  is  standard  in  the  speaker  recognition  community.  Three  methods  of  cohort 
selection  are  implemented  with  respect  to  the  HMM  outputs: 

1.  Difference  of  Means  (dcoAf) 

2.  Bhattacharyya  {dg) 

3.  Symmetric  {dsvM) 

The  same  methods  are  applied  to  word  level  and  speaker  level  HMMs  with  the  residts 
shown  in  Table  5.  Simil2u:  to  the  results  from  speeiker  identification,  speaker  verification 
performance  improves  through  the  use  of  word  models.  In  all  csises,  the  Equal  Error  Rate 
(EER)  is  reduced  by  using  the  word  level  models.  The  type  of  model  does  not  effect  the 
relationship  of  the  results  among  the  cohort  selection  techniques.  For  example,  in  Table  5 
Bhattachars^a  provided  the  lowest  EER  for  five  cohorts  and  one  combination-lock  phrase. 
This  is  true  for  both  word  and  speaker  models. 
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Table  5.  Equal  Error  Rates  for  Speaker  Verification  using  HMM 


Method 

|C| 

Speaker  Models  (Word  Models) 

1 

2 

4 

doOM 

19.92  (16.09) 

17.50  (11.09) 

15.94  (6.25) 

15.55  (11.95) 

12.97  (8.12) 

10.62  (5.31) 

12.73  (10.46) 

10.16  (6.25) 

8.12  (4.69) 

11.88  (9.90) 

8.89  (5.91) 

7.83  (4.06) 

11.42  (9.77) 

8.90  (5.77) 

7.19  (3.47) 

da 

21.56  (16.48) 

19.53  (10.97) 

17.80  (6.25) 

15.16  (12.97) 

12.50  (8.47) 

11.22  (5.00) 

14.37  (11.10) 

11.43  (7.19) 

9.63  (4.69) 

11.88  (10.23) 

8.93  (5.78) 

8.44  (3.75) 

11.17  (9.52) 

8.27  (5.65) 

6.92  (3.38) 

dsYM 

22.34  (15.32) 

20.64  (11.41) 

18.73  (7.45) 

17.19  (12.81) 

14.81  (8.26) 

13.12  (5.37) 

15.30  (11.80) 

12.66  (7.22) 

10.62  (4.31) 

12.59  (10.63) 

9.84  (6.56) 

8.75  (4.38) 

_5J 

11.95  (9.77) 

9.34  (6.54) 

7.23  (4.38) 

4.3.1  Cohort  Selection  Using  Perceptron  Outputs.  The  neiural  post-processor 
provides  another  means  of  determining  the  cohort  speakers.  Under  certain  conditions, 
outputs  of  SLPs  approximate  a  posteriori  probabilities  and  therefore  provide  an  enhanced 
basis  for  cleissification  [13].  The  three  methods  of  cohort  selection  applied  to  the  stand 
alone  HMM  systems  are  again  employed.  In  addition,  a  new  method  of  cohort  selection 
based  on  the  SLP  outputs  is  used.  The  new  method  ranks  the  values  of  the  output  layer 
to  provide  a  cohort  list  for  each  utterance.  It  then  combines  all  test  utterance  rankings  for 
each  speaker  to  form  an  overall  cohort  set. 


The  results  in  Table  6  show  for  small  cohort  set  size,  speaker  verification  performance 
is  degraded  by  the  SLP  outputs.  However,  as  the  cohort  set  increases  in  size,  the  perfor¬ 
mance  improves  past  that  of  the  systems  without  the  SLP  for  the  Difference  of  Means  and 
Symmetric  methods.  The  Bhattacharyya  method  experiences  a  reduction  in  performance. 
The  new  SLP  cohort  selection  method  provides  the  best  EER  in  63.3%  of  the  cases  exam- 


Table  6.  SLP  Post-Processor  Speaker  Verification  Equal  Error  Rates 


Method 

|C| 

Speaker  Models  (Word  Models) 

1 

2 

4 

dooM 

20.41  (18.20) 

17.37  (13.15) 

15.24  (9.12) 

15.78  (13.68) 

13.27  (9.04) 

10.63  (5.31) 

13.13  (10.23) 

10.32  (5.94) 

9.06  (3.39) 

10.70  (9.05) 

8.13  (5.16) 

6.52  (3.12) 

5 

9.84  (7.34) 

7.19  (4.22) 

5.94  (1.88) 

da 

1 

27.56  (19.92) 

25.61  (15.16) 

23.48  (11.28) 

2 

20.15  (14.61) 

17.54  (9.86) 

15.37  (5.70) 

3 

17.02  (10.72) 

14.72  (6.41) 

11.94  (3.07) 

4 

14.46  (9.53) 

11.91  (5.78) 

10.60  (2.57) 

5 

11.56  (7.89) 

9.67  (4.53) 

8.12  (1.59) 

dsYM 

21.87  (19.84) 

18.92  (15.30) 

17.43  (10.62) 

17.73  (14.77) 

15.19  (9.53) 

12.81  (5.31) 

13.91  (10.94) 

11.09  (6.70) 

9.96  (4.06) 

11.58  (9.06) 

8.94  (5.03) 

7.52  (2.19) 

9.84  (7.97) 

7.46  (4.36) 

6.56  (1.56) 

dsLP 

19.46  (18.58) 

16.10  (13.91) 

14.39  (9.98) 

15.14  (13.58) 

12.50  (8.63) 

10.25  (4.98) 

12.67  (10.00) 

10.03  (5.47) 

9.05  (3.15) 

11.39  (8.66) 

8.75  (5.00) 

7.20  (3.10) 

AJ 

9.84  (7.50) 

7.31  (4.19) 

5.62  (1.88) 

ined.  This  is  significant  not  only  in  obtaining  an  improvement  in  EER,  but  also  in  the  fact 
that  cohort  selection  is  made  much  simpler.  Instead  of  relying  on  calculations  of  means 
and/or  variances,  it  is  simple  a  series  of  ranking  which  can  be  computed  quickly. 


4.4  HMM/FST  Compamtive  Test 

In  order  to  compue  HMMs  with  FSTs,  an  HMM  spe2dcer  identification  system  is 
developed  that  uses  the  5  state,  left-to-right,  word  models.  This  type  of  HMM  structure  is 
closely  related  to  the  trajectories  used  in  the  FST  because  both  force  movement  fi:om  the 
beginning  to  the  end.  In  the  HMM,  you  must  start  in  the  first  state  and  can  either  stay 
in  the  same  state  or  transition  to  the  next.  For  the  FST,  the  similar  case  is  true  in  that 
when  comparing  trajectories,  the  mapping  may  stay  at  the  same  segment,  move  one  ahead. 
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or  move  two  segments  ahead.  The  HMM  was  limited  to  only  moving  one  state  because 
with  only  five  states,  a  two  state  skip  represents  a  large  movement  through  the  HMM 
structure.  To  facilitate  comparison  with  the  FST  systems  developed,  test  data  consists 
only  of  utterances  of  the  word  ‘ONE’  fi:om  each  of  thirty-two  female  speakers.  The  HMM 
system  was  able  to  obtain  a  speaker  identification  rate  of  9.90%.  This  rate  is  extremely 
low,  but  it  performs  three  times  better  than  chance  (1/32  =  3.13%).  This  rate  will  be 
compared  against  the  FST  speaker  identification  systems. 

4.5  Feature  Space  Trajectory  Speaker  Identification 

The  first  test  uses  all  instances  of  the  word  ‘ONE’  fi:om  the  training  data  as  individual 
trajectories.  Each  speaker  has  forty-two  such  instances.  Speaker  identification  is  performed 
by  comparing  each  instance  of  the  word  ‘ONE’  from  the  test  data  against  each  of  the  forty- 
two  training  trajectories  firom  all  thirty-two  speakers.  This  results  in  1344  compeirisons 
for  each  test  utterance.  This  method  performs  speaker  identification  at  the  rate  of  65.52% 
correct. 

The  second  test  uses  a  single  utterance  from  each  spe2iker’s  training  data  zts  the 
template.  This  instance  is  chosen  by  calculating  the  cumulative  Ney-based  FST  distance 
to  all  other  training  instances  and  identifying  the  mininniTn-  This  method  performs  best 
of  all  single  template  techniques,  see  Table  7. 

The  other  FST  spe2iker  identification  tests  use  all  of  the  training  data  from  a  speaker 
to  establish  a  sixteen  word  codebook.  This  codebook  attempts  to  represent  all  training 
data  in  the  form  of  one  sixteen  vertex  trajectory.  The  advantage  to  this  technique  is  that 
the  number  of  comparison  per  test  utterance  is  reduced  from  1344  to  32.  A  reduction  of 
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more  than  one  order  of  magnitude.  The  four  methods  described  in  Chapter  IV  are  used. 
The  results  are  shown  in  Table  7 


Table  7.  Speaker  identification  accuracy  for  various  single  template  generation  techniques 


Method 

Stand  Alone(%) 

SLP  Post-Processor 

Ney-FST  Min  Distance  Traj 

33.71 

53.90 

Mahalanobis /Median 

25.71 

54.48 

Mahalanobis /Mean 

24.95 

49.71 

Euclidean/Mean 

25.14 

37.33 

Euclidean /Median 

21.12 

35.81 

4.5.1  FST  Speaker  Identification  using  Neural  Post-Processing.  A  SLP  percep- 
tron  is  implemented  as  a  post-processor  to  enhance  the  performance  of  the  FST  speaker 
identification  system.  The  SLP  accepts  the  32  distance  measmres  as  inputs  and  provides  32 
outputs.  The  output  node  with  the  maximum  value  is  chosen  as  the  node  which  represents 
the  identified  speaker.  The  results  eure  shown  in  the  last  colunm  of  'Ibble  7. 

These  results  are  much  improved  over  the  stand  alone  FST  system  and  once  again 
demonstrate  the  enhancement  provided  by  neural  post-processing.  In  all  CEises,  the  percep- 
tron  post-processor  was  able  to  improve  the  identification  rate.  The  Mahalzmobis  based 
distance  templates  provided  the  largest  improvement  with  their  scores  doubling.  The  tem¬ 
plate  created  by  determining  the  training  trajectory  that  minimizes  the  distance  to  all 
other  training  trajectories  showed  the  next  best  improvement  of  over  20%.  The  Euclidean 
based  templates  showed  minimal  improvement.  These  results  show  that  the  Mahalanobis 
based  templates  provide  greater  discriminative  ability.  Such  templates  are  able  to  produce 
distance  measures  which  the  SLP  can  exploit  more  effectively  than  the  Euclidean  based 
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templates.  In  other  words,  as  far  as  the  SLP  is  concerned,  the  Mahalanobis  based  distance 
metrics  are  better  features  than  the  distance  metrics  produced  by  the  other  templates. 

4.6  Conclusions 

This  chapter  has  provided  the  results  of  the  various  tests  performed  diuring  this 
effort.  A  new  Ney-based  FST  speaker  identification  was  developed.  The  performance 
is  poor  when  compared  with  other  speaker  identification  systems;  however,  this  system 
had  severe  constraints.  The  FST  system  is  only  using  single  word  test  utterances.  A 
comparable  HMM  system  using  only  single  word  test  data  was  developed  and  performed 
much  worse  than  the  FST.  This  result  is  important  because  although  performance  is  not 
at  the  desired  level,  it  is  outperforming  an  established  speadcer  identification  method  with 
only  small  amounts  of  data. 
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V.  Conclusions  and  Recommendations 


5.1  Introduction 

This  chapter  provides  the  conclusions  and  recommendations  based  on  the  results  de¬ 
tailed  in  the  previous  chapter.  Discussion  of  the  HMM  with  SLP  post-processing  is  given 
first.  Next,  the  results  of  applying  the  FST  to  speaker  recognition  is  accomplished.  The 
third  section  analyzes  the  comparative  test  between  HMMs  and  FSTs.  Finally,  recommen¬ 
dations  for  follow-on  research  is  provided. 

5.2  Hidden  Markov  Models  with  SLP  Post- Processing 

SLP  post-processing  provides  a  definite  enhemcement  of  speeiker  recognition  perfor¬ 
mance  for  the  HMM  based  systems  developed.  These  results  were  consistent  for  both 
ergodic  and  left-to-right  HMM  systems.  The  speaker  based  model  accuracy  went  firom 
86.25%  to  96.88%  for  speaker  identification,  and  equal  error  rates  dropped  firom  6.92%  to 
5.62%.  The  word  based  model  accuracy  went  firom  96.88%  to  99.06%  identification,  and 
equal  error  rates  dropped  firom  3.38%  to  1.56%.  These  results  show  the  utility  of  using 
SLP  post-processing  of  simple  HMM  systems.  It  permits  simpler  HMMs  with  less  firee 
parameters  to  obtain  performance  similar  to  that  of  more  complex  HMM  systems. 

5.3  Feature  Space  Trajectories 

This  resezurch  has  produced  the  first  speeiker  recognition  system  using  FSTs.  Thus, 
the  feasibility  of  applying  FSTs  to  speaker  recognition  has  been  proven.  They  can  be 
implemented  in  a  stand  alone  manner  or  incorporated  with  neural  post-processing.  The 
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key  issue  regarding  the  use  of  FSTs  in  speech  was  found  to  be  template  generation.  The 
goal  will  be  to  keep  the  level  of  performance,  but  reduce  the  size  of  the  training  database. 
The  \iltimate  solution  would  be  to  have  one  template  per  word  for  every  speaker.  New 
systems  were  developed  towaurd  this  goal  and  showed  favorable  results  because  they  were 
able  to  double  the  performance  of  a  similar  HMM  based  system.  These  were  the  first  such 
FST  systems  ever  developed  and  have  laid  valuable  groundwork  for  further  reseMch  in  this 
area. 

5.4  Hidden  Markov  Models  vs  Feature  Space  Trajectories 

In  order  to  make  a  direct  comparison  between  HMMs  and  FSTs  and  their  application 
to  speaker  recognition,  a  test  was  devised  based  on  the  limited  performance  of  the  FST.  The 
HMM  system  used  5  state  left-to-right  word  models  and  was  tested  using  all  test  instances 
of  the  word  ‘ONE.’  The  HMM  speaker  identification  was  9.90%.  All  of  the  FST  speaker 
identification  methods  more  than  doubled  this  result  with  the  best  method  obtaining  a 
rate  of  65.52%.  This  shows  the  ability  of  the  FST  to  outperform  the  HMM  on  limited 
amounts  of  test  data  emd  justifies  the  need  for  further  reseaurch. 

5.5  Recommendations 

Application  of  FSTs  to  speaker  recognition  had  not  been  accomplished  prior  to  this 
effort.  This  thesis  provides  a  solid  foimdation  to  show  that  FSTs  offer  a  promising  area  for 
research.  The  primzuy  concern  for  follow-on  reseairch  should  be  template  generation.  The 
test  conducted  that  used  each  individual  training  utterance  provided  outstanding  results 
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that  exceed  HMM  performance.  This  suggests  that  an  effective  method  of  reducing  the 
munber  of  templates  could  also  outperform  a  similar  HMM. 

This  research  focused  on  a  feature  set  consisting  of  Mel-Prequency  Cepstral  Coeffi¬ 
cients  (MFCC).  These  features  provide  good  performance  in  HMM  based  system;  but,  a 
detailed  investigation  into  the  proper  feature  set  for  FSTs  may  prove  valuable. 
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Appendix  A.  Feature  Space  Trajectory  Code 


A.l  FSTNN  Code 

The  FSTNN  wzis  implemented  using  code  developed  in  Matlab.  The  training  code 
was  developed  by  Gary  Brandstrom  during  his  thesis  and  was  not  altered  [6].  Brandstrom’s 
test  code  was  used  as  a  starting  point  for  the  incorporation  of  Ney’s  algorithm  and  was 
used  with  minor  modifications. 


% - 

y«  Filename:  fst_tstNEY.m 

'/•  Developed  by  minimal  changes  to  code  developed 
y,  by  Geory  Brandstrom  (fst_tst.m).  Changes  developed 
y  by  Eric  Zeek  and  Neal  Bruegger 

y. 

- 

y  c  =  Vector  Inner  Products 

y 

y  V  B  matrix  where  each  row  represents  a  segment  between 
y  points  on  a  trajectory 

y 

y  len>=  length  of  segment  between  points  on  a  trajectory 

y 

y  T  «  a  matrix  where  each  row  represents  a  feature  vector 
y  from  a  known  image  sequence. 

y 

y  P  >  a  matrix  where  each  row  represents  a  feature  vector 
y  from  a  unknown  image  sequence. 

y 

y -  variables  out  - 

y 

y  D  B  vector  representing  distance  from  each  point  of  P  to 
y  the  trajectory 

y 

y  Pp  *  nearest  point  on  trajectory  where  each  point  P  projects 

y 

y  S  ■=  vector  representing  sequence  of  line  segment  projections 
% - 


function  [S,D,Pp,d]  =  fst.tstNEY  (c,v,len,T,P) 

[s,dim]’=size(T) ;  y  s  *  number  of  vertices  in  training  trajectory, 
y  dim  *  length  of  feature  vector 

ss  *  size(P,l);  */,  ss  *  number  fo  vertices  in  test  trajectory 

Pptemp»zeros  ( s , dim)  ;  ’/,  initialize 
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ind  =  1; 


for  n=l:ss,  X  loop  once  for  each  test  point 

u=T(l,:)-P(n,;); 
d(n,l)=sqrt(u*u’) ; 

PptempCl,  ;)=T(1, ;) ;  */,  Ppten^)  is  P  prime 

X  now  find  distance  of  nth  test  point  to  each  segment 
ef=T*P’;  X  e  is  (i,n)  and  f  is  (i+l,n) 

for  i=l:s-l;  X  for  each  segment 

u=P(n, :)-T(i, :) ; 
alpha=u*v(i, ; 

if  alpha<=0,  X  closest  point  is  start  of  segment 

temp=P(n, :)-T(i, :) ; 
d(n,i+l)=sqrt(ten5)*temp’) ; 

Ppteiq)(i+1,  :)=T(i, :) ; 

elseif  alpha>len(i) ,  X  closest  point  is  next  segment 
ten5>=P(n,  :)-T(i+l, :) ; 
d  (n ,  i+1 )  “sqrt  (ten5)*ten5)  ’ )  ; 

Ppteir5)(i+1,  :)*T(i+l, :) ; 

else,  X  closest  point  is  on  segment 

a^l-alpha/lenCi) ; 
b«alpha/len(i) ; 

d(n,i+l)<«P(n,  :)*P(n, :)  ’  -  2*a*ef(i,n)  -  2*b*ef  (i+l,n) 

+  2*a*b*c(i,i+l)  +  a*a*c(i,i)  +  b*b*c(i+l,i+l) ; 
d(n,i+l)*sqrt(d(n,i+l)) ; 

Pptenq)(i+1, :)=a*T(i, :)+b*T(i+l, :) ; 
end 
end 

X  Addition  of  weights  to  movement  in  distance  matrix 
if  ind  <  s-1, 

d(n,ind+2)  *  2*d(n,ind+2) ; 
d(n,ind)  1.5*d(n,ind) ; 

[D  (n)  ,  ind2]  ^min  (d  (n ,  ind :  ind+2)  )  ; 
else 

d(n,ind)  «=  1 .5>fd(n,ind) ; 

[D(n)  ,ind2]=min(d(n,ind:s)) ; 
end; 

ind  =  ind  +  ind2  -  1; 

Pp(n, :)*Ppten5)(ind, :) ;  X  point  on  traj  corresp  to  min  dist 

S(n)=ind;  X  segment  where  Pp  projects 

end 


48 


A. 2  FST  Testing  Code 


•/. 

•/. 

•/. 

•/. 

•/. 

*/.• 


Filename:  zfstnnNEY.m 

Tests  eill  test  data  against  every  training  trajectory 
for  all  speedcers.  Results  are  placed  in  confmatNEY.mat 


’/,  Set  up  airray  of  words  in  the  vocabulary 
format  =  ’*/,s’; 

fid  =  fopen( Vhome/bachl/ezeek/YOHO/Scripts/yohowords.list’ , ’rO ; 
for  i  =  1:16, 

eval  ([’word*  int2str(i)  ’  ■  fscanf (fid, format, 1) ;’] ) 
end; 

’/,  Initialize  necessary  data  structures 
load  /home/bachl/ezeek/YOHO/SID/f emales . list 
results  >=  zeros(4,32): 

Dist  *  zeros (100, 32) ; 
count  -  0; 

for  j  "  1:32,  %  which  speakers  to  use  in  verification  test 

for  k  -  1:1,  */,  which  words  to  verify  against 

eval([’cd  /home/bachl/ezeek/YOHO/verif y/ ’  int2str(femaJ.es(j))  ’/word’]) 
eval(C’word  ■  word’  int2str(k)  ’;*]) 
eveLL(C’load  ’  word  ’.num;’]) 
eval( [’frames  ■  ’  word  ’;’]) 
eval([’load  ’  word  ’.dat;’]) 
eval([’raw  *  ’  word  ’;’]) 
nSeq  size(frame8,l) ; 
cum  cumsum(frame8) ; 

for  l«l:nSeq,  %  number  of  word  instances 
for  p  ■  1:4, 

data(p,:)  -  raw(l+((p-l)*nSeq) , :) ; 
end; 

if  1  >  1, 

eframe  -  (nSeq*4+l)+cum(l-l)-(4*(l-l)) ; 
else 

eframe  *  (nSeq*4+l) ; 
end; 

sframe  =  eframe  +  frames  (1)  ~  4; 
for  q  *  5:frames(l), 
sframe  ^  sframe  -  1; 
data(q, : )  *  raw(sf rame , : ) ; 
end; 
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P  B  data(l:frames(l) , ; 


*/•  Nov  test  this  utterance  against  all  training  utteramces 
count  =  count  +1; 
for  m  =  1:32, 

'/•  Only  test  word  of  test  utterance 

eval([’cd  /home/bachl/ezeek/YOHO/enroll/’ 
int2str (females (m))  ’/word’]) 

eval([’load  ’  word  ’.num;’]) 

eval( C’tframes  =  ’  word  ’;’]) 

ntSeq  size(tframes,l) ; 

cd  /home/bachl/ezeek/YOHO/FST/Training 

best  =  Inf; 

for  o  =  l:ntSeq, 

evalCE’load  ’  int2str (females (m))  word  int2str(o)  ]) 
[S,D,Pp,d]  =  fst_tstNEY(c,v,len,T,P) ; 
ten^  s  mean(D) ; 
if  ten^  <  best 
Dist(co\int,m)  «  teoq>: 
best  temp: 

end; 

end; 

end; 

end; 

raw  ■  □ ; 
end; 

end; 

save  confmatNEY  results  Diet  count; 


% 

% 

% 

7. 

7. 

7.- 


Filename:  zfstnnNEYbest.m 

Tests  all  test  data  against  each  speaker’s 

“best”  tenqjlate.  Results  placed  in  BESTall.mat 


format  *  ’7t8’; 

f id  “  f open( ’ /home/bachl/ ezeek/YOHO/Scripts/yohovords . list ’ , ’ r ’ ) ; 
for  i  “  1:16, 

eval  ([’word’  int2str(i)  ’  =  fscanf (fid, format ,1) ;’] ) 
end; 

load  /home/bachl/ ezeek/YOHO/SID/f emales . list 
Dist  «  zeros (1000, 32) ; 
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count  =  0; 


for  j  -  1:32,  '/.  which  speakers  to  use  in  verification  test 

for  k  =  1:1,  y,  which  words  to  verify  against 

eval ( [ ’ cd  /home/bachl/ezeek/YOHO/verif y/ *  int2str (females ( j ) )  ’ /word ’ ] ) 

eval( [’word  =  word’  int2str(k)  ’;’]) 

eval([’load  ’  word  ’.num;’]) 

eval( [’frames  =  ’  word  ’;’]) 

eval([’load  ’  word  ’.dat;’]) 

eval([’raw  =  ’  word  ’;’]) 

nSeq  =  size(frames,l) ; 

cum  B  cumsumCf  rames)  ; 

for  l=l:nSeq,  */,  number  of  word  instances 
for  p  =  1:4, 

dataCp,:)  ■  raw(l+((p-l)*nSeq) , :) ; 
end; 

if  1  >  1, 

eframe  =  (nSeq*4+l)+cum(l-l)-(4*(l-l)) ; 
else 

eframe  =  (nSeq*4+l) ; 
end; 

sframe  «  eframe  +  frames (1)  -  4; 
for  q  ■  5:frames(l), 
sframe  «  sframe  -  1; 
dataCq, : )  «  raw(sf rame , : ) ; 
end; 

P  >  data(l: f rames (1) ,:) ; 

y.  Now  test  this  utterance  against  all  training  utterances 
count  *  count  +  1; 

cd  /home/bachl/ezeek/YOHO/FST/Training 
for  m  *  1:32, 

*/•  Only  test  word  of  test  utterance 

eval ([’load  ’  int2str (females (m))  word  ]) 

[S,D,Pp,d]  ■  fst_tstNEY(c,v,lon,T,P) ; 

Dist (count ,m)  «  mean(D) ; 

end; 

end; 

raw  -  []  ; 
end; 
end; 

save  BESTall  Dist  count; 


% - 

y.  Filename:  zfstnnNEYmahl.m 
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'/.  Tests  all  test  data  against  each  speaker’s 
'/•  Mahalanobis  distance  mediem  ordered  ten^late. 
y.  Results  placed  in  mahHEDall.mat 

•/. 

I - 

y.  Get  test  utterances 

format  =  ’’/.s’; 

fid  =  fopen(’/home/bachl/ezeek/YQHQ/Scripts/yoho«ords.list’ , ’r’) ; 
for  i  «  1:16, 

eval  ([’word’  int2str(i)  ’  =  fscanf (fid, format, 1) ; ’]) 
end; 

load  /home/bachl/ezeek/YOHO/SID/f emales . list 
Dist  =  zeros (1000, 32) ; 
count  =  0; 

for  j  >=  1:32,  */,  which  speakers  to  use  in  verification  test 

for  k  *  1:1,  %  which  words  to  verify  eigainst 

eval([’cd  /home/bachl/ezeek/YOHO/verify/’  int2str(femELles(j))  ’/word’]) 

eval([’word  ■■  word’  int2str(k)  ’;’]) 

eval([’load  ’  word  ’.num;’]) 

eval( [’frames  ■  ’  word  ’;’]) 

eval([’load  ’  word  ’.dat;’]) 

eval([’raw  *  ’  word  ’;’]) 

nSeq  «  size(frames,l) ; 

cum  *  cumsum(frames) ; 

for  l=l:nSeq,  '/,  number  of  word  instances 
for  p  ■  1:4, 

data(p,:)  *  raw(l+((p-l)*nSeq) , :) ; 
end; 

if  1  >  1, 

eframe  ■  (nSeq*4+l)+cum(l-l)-(4*(l“l)) ; 
else 

eframe  ■  (nSeq*4+l) ; 
end; 

sframe  »=  eframe  +  frames  (1)  -  4; 
for  q  ®  5:frames(l), 
sframe  -  sframe  -  1; 
data(q, : )  =  raw(sf rame , : ) ; 
end; 

P  *=  datad : frames (1)  ,:)  ; 

y.  Now  test  this  utterance  against  all  training  utterances 
coimt  *  count  +  1; 
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cd  /home/bachl/ezeek/Y0H0/FST/Training/MAHmedl6 
for  m  ■  1:32, 

%  Only  test  word  of  test  utterance 

eval([’load  ’  int2str (females (m))  word  ]) 
[S,D,Pp,d]  =  fst_tstNEY(c,v,len,T,P) ; 

Dist (count ,m)  =  mean(D) ; 


end; 

end; 

raw  =  [] ; 
end; 
end; 

save  mahMEDall  Dist  count; 


% - - 

%  Filename:  zfstnnIIEYmah2.m 

*/•  Tests  all  test  data  against  each  speeiker’s 

%  Haheilanobis  distance  mean  ordered  ten^late. 

%  Results  placed  in  mahHEANall.mat 

*/. 

^ - - - 

y.  Get  test  utterances 

format  ■  ’^s’ ; 

fid  -  f open( ’ /home/bachl/ezeek/YQHO/Scripts/yohowords . list ’ , ’ r ’ ) ; 
for  i  ■  1:16, 

eval  ([’word’  int2str(i)  ’  «  fscanf (fid, format, 1) ; ’]) 
end; 

load  /home/bachl/ezeek/YOHO/SID/f emales . list 
Dist  *  zeros (1000, 32) ; 
count  ■  0; 

for  j  *  1:32,  %  which  speakers  to  use  in  verification  test 

for  k  1:1,  %  which  words  to  verify  against 

eval([’cd  /home/bachl/ezeek/YOHO/verify/’  int2str(fem2J.es(j))  ’/word’]) 

evaKE’word  ■  word’  int2str(k)  ’;’]) 

evaKC’load  ’  word  ’.num;’]) 

eval( [’frames  ■  ’  word  ’;’]) 

eval([’load  ’  word  ’.dat;’]) 

eval([’raw  ■  ’  word  ’;’]) 

nSeq  «  size(frames,l) ; 

cum  B  cumsum(f  rames)  ; 

for  l^iriiSeq,  V,  number  of  word  instances 
for  p  ■  1:4, 

data(p,:)  *  raw(l+((p-l)*nSeq) , ;) ; 
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end; 

if  1  >  1, 

eframe  =  (nSeq*4+l)+cum(l-l)-(4*(l“l)) ; 
else 

eframe  =  (nSeq*4+l) ; 
end; 

sframe  «  eframe  +  frames (1)  -  4; 
for  q  s  5:frames(l), 
sframe  =  sframe  -  1; 
dataCq, :)  «  raw ( sframe, :) ; 
end; 

P  =  data(l:frames(l) , :) ; 

*/.  Nov  test  this  utterance  against  all  training  utterances 
count  =  count  +1; 

cd  /home/bachl/ezeek/YQH0/FST/Training/MAHmeanl6 
for  m  »  1:32, 

*/,  Only  test  word  of  test  utterance 

evalCC’load  ’  int2str (females (m))  word  ]) 
[S,D,Pp,d]  «  fst_tstNEY(c,v,len,T,P) ; 
Di8t(count ,m)  ■  mean(D); 

end; 

end; 

raw  ■  [] ; 
end; 
end; 

save  mahHEANall  Dist  count; 


- 

%  Filename:  zfstnnNEYeucl.m 
%  Tests  all  test  data  against  each  spe^d^er’s 
%  ten^late  developed  using  Euclidean  distance  with 
y.  median  ordering.  Results  placed  in  eucHEDall.mat 


format  ■  ’’/.s’; 

fid  *  f open( ’ /home/bachl/ ezeek/YOHO/Scripts/yohovords . list ’ , ’ r ’ ) 
for  i  “  1:16, 

eval  ([’word’  int2str(i)  ’  ■  fscanf (fid, format, 1) ; ’]) 
end; 

load  /home/bachl/ ezeek/YOHO/SID/f emales . list 
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Dist  =  zeros (1000,32) ; 
coimt  =  0; 

for  j  B  1:32,  7,  which  speakers  to  use  in  verification  test 

for  k  =  1:1,  7,  which  words  to  verify  against 

eval ( C ’ cd  /home/bachl/ezeek/YOHO/verif y/ *  int2str (females ( j ) )  ’ /word’] ) 

eval([’word  =  word’  int2str(k)  ’;’]) 

eval([’load  ’  word  ’.num;’]) 

eval([’frames  =  ’  word  ’;’]) 

eval ( [ ’ load  ’  word  ’ . dat ; ’ ] ) 

eved([’raw  =  ’  word  ’;’]) 

nSeq  =  size(frames,l) ; 

cum  cumsum(frames) ; 

for  l>=l:nSeq,  */,  number  of  word  instances 
for  p  *  1:4, 

data(p,:)  *  raw(l+((p-l)*nSeq) , :) ; 
end; 

if  1  >  1, 

eframe  =  (nSeq*4+l)+c\im(l-l)-(4*(l-i)) ; 
else 

eframe  *  (nSeq*4+l) ; 
end; 

sframe  «  eframe  frames  (1)  -  4; 
for  q  *  5:frames(l), 
sframe  «  sframe  -  1; 
data(q, :)  ■  raw(sframe, :) ; 
end; 

P  data(l:frames(l)  , :)  ; 

7t  Now  test  this  utterance  against  all  training  utterances 
count  ®  count  +  1; 

cd  /home/bachl/ezeek/Y0H0/FST/Training/VQmedl6 
for  m  =  1:32, 

7«  Only  test  word  of  test  utterance 

eval([’load  ’  int2str (females (m))  word  ]) 

CS,D,Pp,d]  =  fst_tstNEY(c,v,len,T,P) ; 

Dist(count,m)  «  mean(D) ; 

end; 

end; 

raw  =  []  ; 
end; 
end; 

save  eucHEDall  Dist  count; 


% - 

7,  Filename:  zfstnnNEYeuc2.m 
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y.  Tests  all  test  data  against  each  speedcer’s 
'/,  Euclidean  distance  mean  ordering  template, 
y.  Results  placed  in  eucMEANall.mat 

y. 

- 


y.  Get  test  uttereuices 
format  =  ’’/.s’; 

fid  =  f open (Vhome/bachl/ezeek/YOHO/Scripts/yohovords. list’ , ’r’) ; 
for  i  =  1:16, 

eval  ([’word’  int2str(i)  ’  =  fscanf (fid, format, 1) ; ’]) 
end; 

load  /home /bachl/ezeek/YOHO/SID/females. list 
Dist  =  zeros (1000, 32) ; 
count  =  0; 

for  j  B  1:32,  V,  which  speakers  to  use  in  verification  test 
for  k  1:1,  y,  which  words  to  verify  against 

eval([’cd  /home/bachl/ezeek/YOHO/verify/’  int2str(females(j))  ’/word’]) 

eval ([’word  =  word’  int2str(k)  ’;’]) 

eval([’load  ’  word  ’.num;’]) 

eval( [’frames  «  ’  word  ’;’]) 

eval([’load  ’  word  ’.dat;’]) 

eval([’raw  =  ’  word  ’;’]) 

nSeq  «  size(frames,l) ; 

cum  *  cumsum(frames) ; 

for  l«l:nSeq,  %  number  of  word  instances 
for  p  =  1:4, 

data(p,:)  ■  raw(l+((p-l)*nSeq) , :) ; 
end; 

if  1  >  1, 

eframe  *  (nSeq*4+l)+cum(l-l)-(4*(l-l)) ; 
else 

eframe  =  (nSeq*4+l) ; 
end; 

sframe  =  eframe  +  frames (1)  -  4; 
for  q  “  5:frames(l), 
sframe  =  sframe  -  1; 
data(q,:)  *  raw(aframe, :) ; 
end; 

P  =  data(l:frames(l) , :) ; 

y.  Now  test  this  utterance  against  all  training  utterances 
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count  =  count  +  1; 

cd  /home/bachl/ezeek/Y0H0/FST/Training/VQmeJuil6 
for  m  =  1:32, 

y,  Only  test  word  of  test  utterance 

eval([’load  ‘  int2str(fenieLles(m))  word  ]) 
[S,D,Pp,d]  =  f st_tstNEY(c,v,len,T,P) ; 
Dist(count,m)  =  mean(D); 

end; 

end; 

raw  =  [] ; 
end; 
end; 

save  eucMEANall  Dist  count; 
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